The KPMG Bicycle Problem¶

Table of contents

  • Introduction
    • Background :
  • Task 1 : Data Quality Assessment
    • Data Quality Framework
    • Importing the data
    • Analyzing the dataframes
      • Analysing Transactions Data
        • Shape and Cardinality
        • Missing Values and Duplicate Rows
        • Date Time Adjustments
      • Fixing the Transactions Data
      • Analysing Customer Demographic Data
        • Shape and Cardinality
        • Null Values and Duplicate Rows
        • Date Time Adjustments
      • Fixing the Customer Demographic Data
      • Analysing Customer Address Data
        • Shape and Cardinality
        • Missing Values and Duplicate Rows
      • Fixing the Customer Address Data
      • Analysing New Customer List Data
        • Shape and Cardinality
        • Missing Values and Duplicate Rows
        • Date Time Adjustments
      • Fixing the New Customer List Data
  • Task 2 : Data Insights
    • Merging the Dataframes
      • Merging Strategy
    • Important Note
    • Building the Postcoded Dataframe
    • Feature Engineering
    • Univariate Analysis
      • Highlights from the Univariate Analysis
    • Bivariate Analysis
      • Highlights from the Bivariate Analysis
    • Multivariate Analysis
  • Task 2 : RFM Analysis
    • Creating the RFM Table
    • Finding the Optimal Number of Clusters
    • Classifying the Customers
    • Visualizing the Clusters
    • Mapping the Clusters to the Master Dataframe
  • Task 2 : Geospatial Analysis
    • Marker Cluster
    • Heatmap
    • Bubble Map
    • Analysing Coastline Proximity
  • Task 3 : Model Development
    • Data Preprocessing
    • Feature Engineering
      • Age and Age Group
      • Geo Location
      • Tenure and Property Valuation
      • Potential Customer ID
    • Building a Basic Logistic Regression Model
      • Feature Selection
      • Defining Categorical and Numerical Columns
      • Building the Pipeline
      • Scores and Predictions
    • Building a Better Model
      • Bulding the Pipeline
      • Applying RandomizedSearchCV on the Pipeline
      • Checking Accuracy of the Model
      • Visualizing the Best Parameters
  • End Notes
  • References

Introduction¶

A classic example of customer segmentaion analysis.

This is a part of the KPMG Job Simulation Program, it used to be availible the forage but is archived now.

This notebook is a summary and updated version of the past three notebooks I've created before which can be visited here:

  • Task 1
  • Task 2
  • Task 3

Background :¶

Apparently I’m finally working at KPMG (please hire me) and I just got my first client, it’s a medium-sized bikes and cycling accessories company, Sprocket Central Pty Ltd.

Me along with my Analytics, Information and modelling team are supposed to help them skyrocket their business 📈👌

They provided me with three datasets:

  • Customer Demographic
  • Customer Addresses
  • Transactions data in the past 3 months

But there are issues with their data, of course, it can not be used to build some magical model right away, also my Associate Director suggested that I should optimise the quality of the customer data before deriving any kind of insights for the company growth.

Which brings us towards our first task !

Task 1 : Data Quality Assessment¶

Fix the issues we encounter in all three datasets, and draft an email to the client identifying the data quality issues and ways to mitigate them.

At the end of this task, we would have it cleaned, made it accurate and consistent, and ready to be used for our analysis.

Data Quality Framework¶

Framework Description
Accuracy The closeness between a value to its correct representation of the real-life phenomenon
Completeness The extent to which data are of sufficient breadth, depth, and scope for the task at hand
Consistency The extent to which data are uniform in format, use, and meaning across a data collection
Currency The freshness of data
Volatility The length of time the data remains valid
Relevancy The extent to which data are appropriate for the task at hand
Validity The extent to which data conform to defined business rules or constraints
Uniqueness The extent to which data are unique within the dataset

Importing the data¶

We will be using pandas libraries to import the data and perform our analysis. You can also use excel or google sheets and whatever tools you're comfortable with.

In [ ]:
# Importing the libraries
import numpy as np
import pandas as pd

# Importing the dataset
xls = pd.ExcelFile(
    "/home/meow/Desktop/internship-speedrun/kpmg/KPMG_VI_New_raw_data_update_final.xlsx"
)

# I'm using black formatting for the code
# Keep note of the naming convention
# We used PascalCase to name all out original datasets

Transactions = pd.read_excel(xls, "Transactions", skiprows=1)
CustomerDemographic = pd.read_excel(xls, "CustomerDemographic", skiprows=1)
CustomerAddress = pd.read_excel(xls, "CustomerAddress", skiprows=1)
NewCustomerList = pd.read_excel(xls, "NewCustomerList", skiprows=1)

Analyzing the dataframes¶

We will analyze the dataframes and see what issues we can find in them.

  • Shape and Cardinality
  • Missing Values
  • Duplicates
  • Data Types

Solutions and mitigation strategies are available at the end of each section.

Analysing Transactions Data¶

Shape and Cardinality¶

  • Shape of the data is 20000 rows and 13 columns
  • Columns with cardinality 2 should be converted to boolean
  • This would be useful for non machine learning approaches
In [ ]:
Transactions.head()
Out[ ]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost product_first_sold_date
0 1 2 2950 2017-02-25 0.0 Approved Solex Standard medium medium 71.49 53.62 41245.0
1 2 3 3120 2017-05-21 1.0 Approved Trek Bicycles Standard medium large 2091.47 388.92 41701.0
2 3 37 402 2017-10-16 0.0 Approved OHM Cycles Standard low medium 1793.43 248.82 36361.0
3 4 88 3135 2017-08-31 0.0 Approved Norco Bicycles Standard medium medium 1198.46 381.10 36145.0
4 5 78 787 2017-10-01 1.0 Approved Giant Bicycles Standard medium large 1765.30 709.48 42226.0
In [ ]:
Transactions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   transaction_id           20000 non-null  int64         
 1   product_id               20000 non-null  int64         
 2   customer_id              20000 non-null  int64         
 3   transaction_date         20000 non-null  datetime64[ns]
 4   online_order             19640 non-null  float64       
 5   order_status             20000 non-null  object        
 6   brand                    19803 non-null  object        
 7   product_line             19803 non-null  object        
 8   product_class            19803 non-null  object        
 9   product_size             19803 non-null  object        
 10  list_price               20000 non-null  float64       
 11  standard_cost            19803 non-null  float64       
 12  product_first_sold_date  19803 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(3), object(5)
memory usage: 2.0+ MB
In [ ]:
# List unique values of all the columns with ascending order of cardinality
# Cardinality is the number of unique values in a column
# This is a good way to check for categorical columns

for col in Transactions.columns:
    print(f"{col}: {Transactions[col].nunique()}")

print("------------------------------------")

# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in Transactions.columns:
    if Transactions[col].nunique() < 10:
        print(f"{col}: {Transactions[col].unique()}")
transaction_id: 20000
product_id: 101
customer_id: 3494
transaction_date: 364
online_order: 2
order_status: 2
brand: 6
product_line: 4
product_class: 3
product_size: 3
list_price: 296
standard_cost: 103
product_first_sold_date: 100
------------------------------------
online_order: [ 0.  1. nan]
order_status: ['Approved' 'Cancelled']
brand: ['Solex' 'Trek Bicycles' 'OHM Cycles' 'Norco Bicycles' 'Giant Bicycles'
 'WeareA2B' nan]
product_line: ['Standard' 'Road' 'Mountain' 'Touring' nan]
product_class: ['medium' 'low' 'high' nan]
product_size: ['medium' 'large' 'small' nan]

Missing Values and Duplicate Rows¶

  • The number of missing values in this dataframe are insignificant and from the same rows, we can drop them.
  • There are no duplicate rows
In [ ]:
# Function to analyse the missing data


def analyze_null_values(dataframe):
    total_null_values = dataframe.isnull().sum()
    total_values = dataframe.count().sort_values(ascending=True)
    null_values_percentage = (total_null_values / total_values) * 100
    missing_data = pd.concat(
        {
            "Null Values": total_null_values,
            "Percentage of Missing Values": null_values_percentage,
            "Total Values": total_values,
        },
        axis=1,
    )

    missing_data = missing_data.sort_values(
        by="Percentage of Missing Values", ascending=False
    )

    return missing_data


analyze_null_values(Transactions)
Out[ ]:
Null Values Percentage of Missing Values Total Values
online_order 360 1.832994 19640
brand 197 0.994799 19803
product_line 197 0.994799 19803
product_class 197 0.994799 19803
product_size 197 0.994799 19803
standard_cost 197 0.994799 19803
product_first_sold_date 197 0.994799 19803
transaction_id 0 0.000000 20000
product_id 0 0.000000 20000
customer_id 0 0.000000 20000
transaction_date 0 0.000000 20000
order_status 0 0.000000 20000
list_price 0 0.000000 20000
In [ ]:
# Show the number of duplicated rows
Transactions.duplicated().sum()
Out[ ]:
0

Date Time Adjustments¶

  • transaction_date should be converted to datetime format
  • product_first_sold_date should be converted to datetime format
  • we will later make a new product_age column by converting the product_first_sold_date to datetime format and subtracting it from the today_date
In [ ]:
# Define start and end date of the dataset
start_date = pd.to_datetime("2017-01-01")
end_date = pd.to_datetime("2017-12-31")

# Convert transaction_date column to standard datetime format
Transactions["transaction_date"] = pd.to_datetime(Transactions["transaction_date"])

# Convert product_first_sold_date column to standard datetime format
# We need to add the timedelta to the date because the date is stored as a number of days since 1900-01-01
# Assuming the dataset was given to us on 2018-01-01

today_date = pd.Timestamp("2018-01-01")
Transactions["product_first_sold_date"] = pd.to_timedelta(
    Transactions["product_first_sold_date"], unit="D"
) + pd.Timestamp("1900-01-01")

date_difference = pd.Timestamp.today() - today_date
Transactions["product_first_sold_date"] = (
    Transactions["product_first_sold_date"] - date_difference
)
In [ ]:
Transactions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   transaction_id           20000 non-null  int64         
 1   product_id               20000 non-null  int64         
 2   customer_id              20000 non-null  int64         
 3   transaction_date         20000 non-null  datetime64[ns]
 4   online_order             19640 non-null  float64       
 5   order_status             20000 non-null  object        
 6   brand                    19803 non-null  object        
 7   product_line             19803 non-null  object        
 8   product_class            19803 non-null  object        
 9   product_size             19803 non-null  object        
 10  list_price               20000 non-null  float64       
 11  standard_cost            19803 non-null  float64       
 12  product_first_sold_date  19803 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(3), int64(3), object(5)
memory usage: 2.0+ MB

Fixing the Transactions Data¶

  • We already fixed the date time format earlier
  • We will convert the online_order and order_status columns to object type and map them as True/False
  • We have their datatype as object, even though they are boolean, because it'll be helpful for machine learning models
  • Don't worry about the NaN values, pipelines will take care of them
In [ ]:
# Mapping the online_order column to True and False
Transactions["online_order"] = Transactions["online_order"].map({1: True, 0: False})

# Mapping the order_status column to True and False
Transactions["order_status"] = Transactions["order_status"].map(
    {"Approved": True, "Cancelled": False}
)

# Print unique values of online_order and order_status columns
print("Unique values of online_order column:", Transactions["online_order"].unique())
print("Unique values of order_status column:", Transactions["order_status"].unique())
Unique values of online_order column: [False True nan]
Unique values of order_status column: [ True False]

Analysing Customer Demographic Data¶

Shape and Cardinality¶

  • Shape of the data is 4000 rows and 13 columns
  • owns_car column should be converted to boolean
  • deceased_indicator column should be converted to boolean
  • gender column should be converted to boolean
  • default column should be dropped as it has no legible data
In [ ]:
CustomerDemographic.head(5)
Out[ ]:
customer_id first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator default owns_car tenure
0 1 Laraine Medendorp F 93 1953-10-12 00:00:00 Executive Secretary Health Mass Customer N "' Yes 11.0
1 2 Eli Bockman Male 81 1980-12-16 00:00:00 Administrative Officer Financial Services Mass Customer N <script>alert('hi')</script> Yes 16.0
2 3 Arlin Dearle Male 61 1954-01-20 00:00:00 Recruiting Manager Property Mass Customer N 2018-02-01 00:00:00 Yes 15.0
3 4 Talbot NaN Male 33 1961-10-03 00:00:00 NaN IT Mass Customer N () { _; } >_[$($())] { touch /tmp/blns.shellsh... No 7.0
4 5 Sheila-kathryn Calton Female 56 1977-05-13 00:00:00 Senior Editor NaN Affluent Customer N NIL Yes 8.0
In [ ]:
CustomerDemographic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   customer_id                          4000 non-null   int64  
 1   first_name                           4000 non-null   object 
 2   last_name                            3875 non-null   object 
 3   gender                               4000 non-null   object 
 4   past_3_years_bike_related_purchases  4000 non-null   int64  
 5   DOB                                  3913 non-null   object 
 6   job_title                            3494 non-null   object 
 7   job_industry_category                3344 non-null   object 
 8   wealth_segment                       4000 non-null   object 
 9   deceased_indicator                   4000 non-null   object 
 10  default                              3698 non-null   object 
 11  owns_car                             4000 non-null   object 
 12  tenure                               3913 non-null   float64
dtypes: float64(1), int64(2), object(10)
memory usage: 406.4+ KB
In [ ]:
# List unique values of all the columns with ascending order of cardinality
# Cardinality is the number of unique values in a column
# This is a good way to check for categorical columns

for col in CustomerDemographic.columns:
    print(f"{col}: {CustomerDemographic[col].nunique()}")

print("------------------------------------")

# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in CustomerDemographic.columns:
    if CustomerDemographic[col].nunique() < 10:
        print(f"{col}: {CustomerDemographic[col].unique()}")
customer_id: 4000
first_name: 3139
last_name: 3725
gender: 6
past_3_years_bike_related_purchases: 100
DOB: 3448
job_title: 195
job_industry_category: 9
wealth_segment: 3
deceased_indicator: 2
default: 90
owns_car: 2
tenure: 22
------------------------------------
gender: ['F' 'Male' 'Female' 'U' 'Femal' 'M']
job_industry_category: ['Health' 'Financial Services' 'Property' 'IT' nan 'Retail' 'Argiculture'
 'Manufacturing' 'Telecommunications' 'Entertainment']
wealth_segment: ['Mass Customer' 'Affluent Customer' 'High Net Worth']
deceased_indicator: ['N' 'Y']
owns_car: ['Yes' 'No']

Null Values and Duplicate Rows¶

  • job_title has 506 missing values
  • job_industry_category has 656 missing values
  • last_name has 125 missing values this whole column can be dropped since it wont be useful for our analysis
  • default has 3027 missing values this whole column can be dropped since it wont be useful for our analysis
  • tenure has 87 missing values
  • DOB has 87 missing values
  • There are no duplicate rows
In [ ]:
# Call the function to analyze the missing data

analyze_null_values(CustomerDemographic)
Out[ ]:
Null Values Percentage of Missing Values Total Values
job_industry_category 656 19.617225 3344
job_title 506 14.481969 3494
default 302 8.166577 3698
last_name 125 3.225806 3875
DOB 87 2.223358 3913
tenure 87 2.223358 3913
customer_id 0 0.000000 4000
first_name 0 0.000000 4000
gender 0 0.000000 4000
past_3_years_bike_related_purchases 0 0.000000 4000
wealth_segment 0 0.000000 4000
deceased_indicator 0 0.000000 4000
owns_car 0 0.000000 4000
In [ ]:
# Show the number of duplicated rows
CustomerDemographic.duplicated().sum()
Out[ ]:
0

Date Time Adjustments¶

  • DOB should be converted to datetime format
  • tenure should be converted to integer format
In [ ]:
# Convert DOB column to standard datetime format
CustomerDemographic["DOB"] = pd.to_datetime(CustomerDemographic["DOB"])
CustomerDemographic["tenure"] = CustomerDemographic["tenure"].astype("Int64")

Fixing the Customer Demographic Data¶

  • We already fixed the DOB and tenure columns earlier
  • We will convert the owns_car and deceased_indicator columns to object type and map them as True/False
  • We have their datatype as object, even though they are boolean, because it'll be helpful for machine learning models
  • We will map the gender column to M and F and make it consistent
In [ ]:
# Converting own_car column to True and False
CustomerDemographic["owns_car"] = CustomerDemographic["owns_car"].map(
    {"Yes": True, "No": False}
)

CustomerDemographic["owns_car"] = CustomerDemographic["owns_car"].astype("object")

# Converting deceased_indicator column to True and False
CustomerDemographic["deceased_indicator"] = CustomerDemographic[
    "deceased_indicator"
].map({"Y": True, "N": False})

CustomerDemographic['deceased_indicator'] = CustomerDemographic['deceased_indicator'].astype('object')

# Converting Gender column to M and F and U with np.nan
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("Female", "F")
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("Femal", "F")
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("Male", "M")
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("U", np.nan)

# Drop the default column
CustomerDemographic.drop(columns="default", inplace=True)

Analysing Customer Address Data¶

Shape and Cardinality¶

  • Shape of the data is 3999 rows and 6 columns
  • state column needs to be mapped to State Initials
In [ ]:
CustomerAddress.head(5)
Out[ ]:
customer_id address postcode state country property_valuation
0 1 060 Morning Avenue 2016 New South Wales Australia 10
1 2 6 Meadow Vale Court 2153 New South Wales Australia 10
2 4 0 Holy Cross Court 4211 QLD Australia 9
3 5 17979 Del Mar Point 2448 New South Wales Australia 4
4 6 9 Oakridge Court 3216 VIC Australia 9
In [ ]:
CustomerAddress.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   int64 
 1   address             3999 non-null   object
 2   postcode            3999 non-null   int64 
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 187.6+ KB
In [ ]:
# Cardinailty of the columns
for col in CustomerAddress.columns:
    print(f"{col}: {CustomerAddress[col].nunique()}")
print("------------------------------------")

# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in CustomerAddress.columns:
    if CustomerAddress[col].nunique() < 10:
        print(f"{col}: {CustomerAddress[col].unique()}")
customer_id: 3999
address: 3996
postcode: 873
state: 5
country: 1
property_valuation: 12
------------------------------------
state: ['New South Wales' 'QLD' 'VIC' 'NSW' 'Victoria']
country: ['Australia']

Missing Values and Duplicate Rows¶

  • no missing values
  • no duplicate rows
In [ ]:
analyze_null_values(CustomerAddress)
Out[ ]:
Null Values Percentage of Missing Values Total Values
customer_id 0 0.0 3999
address 0 0.0 3999
postcode 0 0.0 3999
state 0 0.0 3999
country 0 0.0 3999
property_valuation 0 0.0 3999
In [ ]:
# Print the number of duplicated rows
CustomerAddress.duplicated().sum()
Out[ ]:
0

Fixing the Customer Address Data¶

  • There's no date time adjustments to be made
  • We will map the state column to State Initials
In [ ]:
# Mapping New South Wales to NSW and Victoria to VIC
CustomerAddress["state"] = CustomerAddress["state"].replace("New South Wales", "NSW")
CustomerAddress["state"] = CustomerAddress["state"].replace("Victoria", "VIC")

Analysing New Customer List Data¶

Shape and Cardinality¶

  • Shape of the data is 1000 rows and 23 columns
  • owns_car column should be converted to boolean
  • deceased_indicator column should be converted to boolean
  • gender column should be converted to boolean
  • unnamed columns should be dropped as it has no legible data
In [ ]:
# Set max column display to None
pd.set_option("display.max_columns", None)

NewCustomerList.sample(5)
Out[ ]:
first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car tenure address postcode state country property_valuation Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Rank Value
505 Milty Brauninger Male 13 1945-07-10 Payment Adjustment Coordinator NaN High Net Worth N No 11 54 6th Trail 2640 NSW Australia 7 0.69 0.69 0.690 0.69000 504 504 0.850000
647 Aldin Newsome Male 24 1981-07-06 Financial Analyst Financial Services Mass Customer N No 17 058 Morningstar Center 2127 NSW Australia 9 0.90 0.90 1.125 0.95625 648 648 0.725000
216 Rosalinde Cubuzzi Female 50 1997-09-04 Business Systems Development Analyst NaN Mass Customer N No 5 6 Lotheville Trail 2444 NSW Australia 7 0.75 0.75 0.750 0.63750 215 215 1.128906
235 Rolland Esmead Male 41 1940-07-13 NaN Health Affluent Customer N No 8 72008 7th Avenue 2200 NSW Australia 8 0.93 0.93 0.930 0.93000 233 233 1.100000
342 Tannie Petrakov Male 84 1951-11-27 Data Coordiator IT Affluent Customer N No 10 691 Valley Edge Alley 4078 QLD Australia 6 0.46 0.46 0.575 0.57500 341 341 0.990000
In [ ]:
NewCustomerList.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   first_name                           1000 non-null   object 
 1   last_name                            971 non-null    object 
 2   gender                               1000 non-null   object 
 3   past_3_years_bike_related_purchases  1000 non-null   int64  
 4   DOB                                  983 non-null    object 
 5   job_title                            894 non-null    object 
 6   job_industry_category                835 non-null    object 
 7   wealth_segment                       1000 non-null   object 
 8   deceased_indicator                   1000 non-null   object 
 9   owns_car                             1000 non-null   object 
 10  tenure                               1000 non-null   int64  
 11  address                              1000 non-null   object 
 12  postcode                             1000 non-null   int64  
 13  state                                1000 non-null   object 
 14  country                              1000 non-null   object 
 15  property_valuation                   1000 non-null   int64  
 16  Unnamed: 16                          1000 non-null   float64
 17  Unnamed: 17                          1000 non-null   float64
 18  Unnamed: 18                          1000 non-null   float64
 19  Unnamed: 19                          1000 non-null   float64
 20  Unnamed: 20                          1000 non-null   int64  
 21  Rank                                 1000 non-null   int64  
 22  Value                                1000 non-null   float64
dtypes: float64(5), int64(6), object(12)
memory usage: 179.8+ KB
In [ ]:
# Cardinality of the columns
for col in NewCustomerList.columns:
    print(f"{col}: {NewCustomerList[col].nunique()}")
print("------------------------------------")

# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in NewCustomerList.columns:
    if NewCustomerList[col].nunique() < 10:
        print(f"{col}: {NewCustomerList[col].unique()}")
first_name: 940
last_name: 961
gender: 3
past_3_years_bike_related_purchases: 100
DOB: 961
job_title: 184
job_industry_category: 9
wealth_segment: 3
deceased_indicator: 1
owns_car: 2
tenure: 23
address: 1000
postcode: 522
state: 3
country: 1
property_valuation: 12
Unnamed: 16: 71
Unnamed: 17: 132
Unnamed: 18: 183
Unnamed: 19: 321
Unnamed: 20: 324
Rank: 324
Value: 324
------------------------------------
gender: ['Male' 'Female' 'U']
job_industry_category: ['Manufacturing' 'Property' 'Financial Services' 'Entertainment' 'Retail'
 'IT' 'Telecommunications' 'Health' nan 'Argiculture']
wealth_segment: ['Mass Customer' 'Affluent Customer' 'High Net Worth']
deceased_indicator: ['N']
owns_car: ['Yes' 'No']
state: ['QLD' 'NSW' 'VIC']
country: ['Australia']

Missing Values and Duplicate Rows¶

  • job_title has 106 missing values
  • job_industry_category has 165 missing values
  • last_name has 29 missing values this whole column can be dropped since it wont be useful for our analysis
  • there are no duplicate rows
In [ ]:
# Call the function to analyze the missing data
analyze_null_values(NewCustomerList)
Out[ ]:
Null Values Percentage of Missing Values Total Values
job_industry_category 165 19.760479 835
job_title 106 11.856823 894
last_name 29 2.986612 971
DOB 17 1.729400 983
first_name 0 0.000000 1000
country 0 0.000000 1000
Rank 0 0.000000 1000
Unnamed: 20 0 0.000000 1000
Unnamed: 19 0 0.000000 1000
Unnamed: 18 0 0.000000 1000
Unnamed: 17 0 0.000000 1000
Unnamed: 16 0 0.000000 1000
property_valuation 0 0.000000 1000
address 0 0.000000 1000
state 0 0.000000 1000
postcode 0 0.000000 1000
tenure 0 0.000000 1000
owns_car 0 0.000000 1000
deceased_indicator 0 0.000000 1000
wealth_segment 0 0.000000 1000
past_3_years_bike_related_purchases 0 0.000000 1000
gender 0 0.000000 1000
Value 0 0.000000 1000
In [ ]:
# Check for duplicated rows
NewCustomerList.duplicated().sum()
Out[ ]:
0

Date Time Adjustments¶

  • DOB should be converted to datetime format
In [ ]:
# Convert DOB column to standard datetime format
NewCustomerList["DOB"] = pd.to_datetime(NewCustomerList["DOB"])

Fixing the New Customer List Data¶

  • We already fixed the DOB column earlier
  • We will convert the owns_car and deceased_indicator columns to object type and map them as True/False
  • We will map the gender column to M and F and make it consistent
  • We will drop the unnamed columns
In [ ]:
# Fixing own_car column
NewCustomerList["owns_car"] = NewCustomerList["owns_car"].replace("Yes", True)
NewCustomerList["owns_car"] = NewCustomerList["owns_car"].replace("No", False)
NewCustomerList["owns_car"] = NewCustomerList["owns_car"].astype("object")

# Fixing deceased_indicator column
NewCustomerList["deceased_indicator"] = NewCustomerList["deceased_indicator"].replace("N", False)
NewCustomerList["deceased_indicator"] = NewCustomerList["deceased_indicator"].astype("object")

# Fixing gender column
NewCustomerList["gender"] = NewCustomerList["gender"].replace("Female", "F")
NewCustomerList["gender"] = NewCustomerList["gender"].replace("Male", "M")
NewCustomerList["gender"] = NewCustomerList["gender"].replace("U", np.nan)
In [ ]:
# Dropping all unnamed columns
NewCustomerList.drop(columns="Unnamed: 16", inplace=True)
NewCustomerList.drop(columns="Unnamed: 17", inplace=True)
NewCustomerList.drop(columns="Unnamed: 18", inplace=True)
NewCustomerList.drop(columns="Unnamed: 19", inplace=True)
NewCustomerList.drop(columns="Unnamed: 20", inplace=True)
In [ ]:
NewCustomerList.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   first_name                           1000 non-null   object        
 1   last_name                            971 non-null    object        
 2   gender                               983 non-null    object        
 3   past_3_years_bike_related_purchases  1000 non-null   int64         
 4   DOB                                  983 non-null    datetime64[ns]
 5   job_title                            894 non-null    object        
 6   job_industry_category                835 non-null    object        
 7   wealth_segment                       1000 non-null   object        
 8   deceased_indicator                   1000 non-null   object        
 9   owns_car                             1000 non-null   object        
 10  tenure                               1000 non-null   int64         
 11  address                              1000 non-null   object        
 12  postcode                             1000 non-null   int64         
 13  state                                1000 non-null   object        
 14  country                              1000 non-null   object        
 15  property_valuation                   1000 non-null   int64         
 16  Rank                                 1000 non-null   int64         
 17  Value                                1000 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(5), object(11)
memory usage: 140.8+ KB

Task 2 : Data Insights¶

Now that we have proposed a solution to the data quality issues, we can now use the data to derive insights.

  • We will merge the dataframes to make a master dataframe
  • We will find something to do with the postal codes and adresses
  • We will do a RFM analysis and make customer clusters
  • We will use geopandas to draw customer location heatmaps
  • We will use univariate and multivariate analysis to find insights

Merging the Dataframes¶

We will merge the three dataframes to make a master dataframe, we will use the customer_id column as the key.

Merging Strategy¶

  • Looks like customer_id is the common column in all three traingin dataframes
  • NewCustomerList has no customer_id column since they are new customers
  • We will use customer_id as the primary key to merge the dataframes
  • Since we don't want to deal with duplicate columns, we will use inner join
In [ ]:
# Display all the common columns in all the datasets
common_columns = list(
    set(Transactions.columns)
    & set(CustomerDemographic.columns)
    & set(CustomerAddress.columns)
)
print(common_columns)
['customer_id']
In [ ]:
# Inner join all the datasets
_ = pd.merge(
    Transactions,
    CustomerDemographic,
    how="inner",
    left_on="customer_id",
    right_on="customer_id",
)

MasterDataset = pd.merge(
    _,
    CustomerAddress,
    how="inner",
    left_on="customer_id",
    right_on="customer_id",
)

MasterDataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19968 entries, 0 to 19967
Data columns (total 29 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   transaction_id                       19968 non-null  int64         
 1   product_id                           19968 non-null  int64         
 2   customer_id                          19968 non-null  int64         
 3   transaction_date                     19968 non-null  datetime64[ns]
 4   online_order                         19609 non-null  object        
 5   order_status                         19968 non-null  bool          
 6   brand                                19773 non-null  object        
 7   product_line                         19773 non-null  object        
 8   product_class                        19773 non-null  object        
 9   product_size                         19773 non-null  object        
 10  list_price                           19968 non-null  float64       
 11  standard_cost                        19773 non-null  float64       
 12  product_first_sold_date              19773 non-null  datetime64[ns]
 13  first_name                           19968 non-null  object        
 14  last_name                            19326 non-null  object        
 15  gender                               19513 non-null  object        
 16  past_3_years_bike_related_purchases  19968 non-null  int64         
 17  DOB                                  19522 non-null  datetime64[ns]
 18  job_title                            17589 non-null  object        
 19  job_industry_category                16746 non-null  object        
 20  wealth_segment                       19968 non-null  object        
 21  deceased_indicator                   19968 non-null  object        
 22  owns_car                             19968 non-null  object        
 23  tenure                               19522 non-null  Int64         
 24  address                              19968 non-null  object        
 25  postcode                             19968 non-null  int64         
 26  state                                19968 non-null  object        
 27  country                              19968 non-null  object        
 28  property_valuation                   19968 non-null  int64         
dtypes: Int64(1), bool(1), datetime64[ns](3), float64(2), int64(6), object(16)
memory usage: 4.3+ MB

Important Note¶

Int64 and int64 are different datatypes.

I recieved a peculiar error with the same code when I tried to use Int64 instead of int64, turns out some seaborne functions can not handle non nullable data types.

'Int64': 'Int64' is a nullable integer data type introduced in pandas 0.24.0. It allows for the representation of integer data with the presence of NaN values. This means that if you have missing values (NaN) in your integer column, pandas will use the 'Int64' data type to indicate that it supports nullable integers.

'int64': 'int64' (lowercase 'i') is the regular integer data type (non-nullable) provided by NumPy, which is the underlying library used by pandas. In 'int64' data type, there are no NaN values allowed, and the data is treated as non-nullable integers.

So we will convert all the int64 columns to Int64 to avoid any errors.

In [ ]:
# Check if the datatype is int64 if it is, convert it to Int64

for col in MasterDataset.columns:
    if MasterDataset[col].dtype == "int64":
        MasterDataset[col] = MasterDataset[col].astype("Int64")
        
MasterDataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19968 entries, 0 to 19967
Data columns (total 29 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   transaction_id                       19968 non-null  Int64         
 1   product_id                           19968 non-null  Int64         
 2   customer_id                          19968 non-null  Int64         
 3   transaction_date                     19968 non-null  datetime64[ns]
 4   online_order                         19609 non-null  object        
 5   order_status                         19968 non-null  bool          
 6   brand                                19773 non-null  object        
 7   product_line                         19773 non-null  object        
 8   product_class                        19773 non-null  object        
 9   product_size                         19773 non-null  object        
 10  list_price                           19968 non-null  float64       
 11  standard_cost                        19773 non-null  float64       
 12  product_first_sold_date              19773 non-null  datetime64[ns]
 13  first_name                           19968 non-null  object        
 14  last_name                            19326 non-null  object        
 15  gender                               19513 non-null  object        
 16  past_3_years_bike_related_purchases  19968 non-null  Int64         
 17  DOB                                  19522 non-null  datetime64[ns]
 18  job_title                            17589 non-null  object        
 19  job_industry_category                16746 non-null  object        
 20  wealth_segment                       19968 non-null  object        
 21  deceased_indicator                   19968 non-null  object        
 22  owns_car                             19968 non-null  object        
 23  tenure                               19522 non-null  Int64         
 24  address                              19968 non-null  object        
 25  postcode                             19968 non-null  Int64         
 26  state                                19968 non-null  object        
 27  country                              19968 non-null  object        
 28  property_valuation                   19968 non-null  Int64         
dtypes: Int64(7), bool(1), datetime64[ns](3), float64(2), object(16)
memory usage: 4.4+ MB

Building the Postcoded Dataframe¶

  • We would have used Australian Bureau of Statistics to get the postal codes and their corresponding states, but they don't have a free API
  • So we will use a CSV file instead
  • Made a Postcoded dataframe with the postal codes and fixed the state column, it is now consistent with the postal codes
  • We will later use this with Geopandas to make a heatmap of the customer locations
In [ ]:
# Import the Australian Postcodes dataset
Postcodes = pd.read_csv(
    "/home/meow/Desktop/internship-speedrun/kpmg/australian_postcodes.csv"
)

# Replace 0 with np.nan
Postcodes.replace(0, np.nan, inplace=True)
In [ ]:
# We only need the postcode, longitude and latitude columns
# We also need the state column which will make the state consistent with postcodes in the MasterDataset

Postcodes = Postcodes[["postcode", "lat", "long", "state"]]
In [ ]:
# Drop all the rows with missing values
Postcodes.dropna(inplace=True)

# Drop all the duplicate rows
Postcodes.drop_duplicates(inplace=True)

# Drop all the non unique postcode rows
Postcodes.drop_duplicates(subset="postcode", inplace=True)

Postcodes.head()
Out[ ]:
postcode lat long state
0 200 -35.277700 149.119000 ACT
2 800 -12.458684 130.836680 NT
4 801 -12.458684 130.836680 NT
6 804 -12.428017 130.873315 NT
7 810 -12.381806 130.866242 NT
In [ ]:
Postcoded = MasterDataset.copy()

Postcoded = pd.merge(
    Postcoded,
    Postcodes,
    how="inner",
    left_on="postcode",
    right_on="postcode",
)
In [ ]:
print(Postcoded.shape)
print(MasterDataset.shape)
(19968, 32)
(19968, 29)
In [ ]:
# Since state_y is a feature of postcode it is more reliable than state_x
# So we will drop state_x

Postcoded.drop(columns="state_x", inplace=True)
Postcoded.rename(columns={"state_y": "state"}, inplace=True)
In [ ]:
Postcoded.sample(5)
Out[ ]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost product_first_sold_date first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car tenure address postcode country property_valuation lat long state
6125 8823 42 2540 2017-10-28 True True OHM Cycles Road medium small 1810.00 1610.90 2005-05-20 09:34:07.879592 Donavon NaN NaN 14 NaT Executive Secretary IT Mass Customer False True <NA> 49 Sheridan Circle 2560 Australia 7 -34.194216 150.768408 NSW
9055 15333 7 1338 2017-05-13 False True Giant Bicycles Standard medium small 1311.44 1167.18 1986-10-23 09:34:07.879592 Giuseppe Tezure M 53 1980-11-24 Product Engineer NaN High Net Worth False False 14 02885 Buhler Court 3804 Australia 9 -37.980059 145.325994 VIC
19644 14493 83 2408 2017-08-13 False True Solex Touring medium large 2083.94 675.03 2007-09-28 09:34:07.879592 Celestia NaN F 38 1955-10-10 VP Marketing Health Mass Customer False False 16 9 Hollow Ridge Circle 2500 Australia 8 -34.423171 150.874294 NSW
18464 20000 11 1144 2017-09-22 True True Trek Bicycles Standard medium small 1775.81 1580.47 1993-07-04 09:34:07.879592 Damon Phelip M 73 1996-12-12 Environmental Specialist Health High Net Worth False True 5 4439 7th Terrace 2177 Australia 9 -33.895053 150.879379 NSW
14983 19410 79 1771 2017-01-15 False False Norco Bicycles Standard medium medium 1555.58 818.01 2009-08-21 09:34:07.879592 Zarah Santello F 1 1974-11-01 Geological Engineer Manufacturing Mass Customer False True 14 410 Hoepker Pass 4123 Australia 5 -27.583528 153.116151 QLD

Feature Engineering¶

For deriving insights, we will make a few new columns in our master dataframe. These will be :

  • customer_age : Age of the customer (only keep customers with age less than 100)
  • customer_age_group : Age group of the customer
  • product_age : Age of the product
  • profit : Profit of the transaction
  • recency : Recency of the transaction
  • frequency : Frequency of the transaction
  • transaction_month : Month of the transaction
  • transaction_week : Week of the transaction
  • transaction_day : Day of the transaction
In [ ]:
# Calculate the age of the customers
Postcoded["customer_age"] = (end_date - Postcoded["DOB"]).dt.days // 365.25
Postcoded["customer_age"] = Postcoded["customer_age"].astype("Int64")

# Keeping only the values between 14 and 100
Postcoded = Postcoded[
    (Postcoded["customer_age"] >= 14) & (Postcoded["customer_age"] <= 100)
]

Postcoded.customer_age.describe()
Out[ ]:
count      19513.0
mean     39.855122
std       12.59898
min           15.0
25%           30.0
50%           40.0
75%           49.0
max           86.0
Name: customer_age, dtype: Float64
In [ ]:
# Calculate the age of the products
Postcoded["product_age"] = (end_date - Postcoded["product_first_sold_date"]).dt.days // 365.25
In [ ]:
# Calculate age group
age_group = pd.cut(
    Postcoded["customer_age"],
    bins=[14, 30, 50, 70, 90],
    labels=["14-30", "31-50", "51-70", "71-90"],
)
Postcoded["age_group"] = age_group
In [ ]:
# Calculate profit
Postcoded["profit"] = Postcoded["list_price"] - Postcoded["standard_cost"]

# Calculate recency
Postcoded["recency"] = (today_date - Postcoded["transaction_date"]).dt.days.astype("Int64")

# Calculate frequency

Postcoded["frequency"] = Postcoded.groupby("customer_id")["customer_id"].transform(
    "count"
)
In [ ]:
# Calculate the Transaction Day, Week , Month
Postcoded['transaction_day'] = Postcoded['transaction_date'].dt.strftime('%A')
Postcoded['transaction_week'] = Postcoded['transaction_date'].dt.strftime('%W')
Postcoded['transaction_month'] = Postcoded['transaction_date'].dt.strftime('%B')
In [ ]:
Postcoded.info()
<class 'pandas.core.frame.DataFrame'>
Index: 19513 entries, 0 to 19967
Data columns (total 40 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   transaction_id                       19513 non-null  Int64         
 1   product_id                           19513 non-null  Int64         
 2   customer_id                          19513 non-null  Int64         
 3   transaction_date                     19513 non-null  datetime64[ns]
 4   online_order                         19162 non-null  object        
 5   order_status                         19513 non-null  bool          
 6   brand                                19327 non-null  object        
 7   product_line                         19327 non-null  object        
 8   product_class                        19327 non-null  object        
 9   product_size                         19327 non-null  object        
 10  list_price                           19513 non-null  float64       
 11  standard_cost                        19327 non-null  float64       
 12  product_first_sold_date              19327 non-null  datetime64[ns]
 13  first_name                           19513 non-null  object        
 14  last_name                            18887 non-null  object        
 15  gender                               19513 non-null  object        
 16  past_3_years_bike_related_purchases  19513 non-null  Int64         
 17  DOB                                  19513 non-null  datetime64[ns]
 18  job_title                            17173 non-null  object        
 19  job_industry_category                16291 non-null  object        
 20  wealth_segment                       19513 non-null  object        
 21  deceased_indicator                   19513 non-null  object        
 22  owns_car                             19513 non-null  object        
 23  tenure                               19513 non-null  Int64         
 24  address                              19513 non-null  object        
 25  postcode                             19513 non-null  Int64         
 26  country                              19513 non-null  object        
 27  property_valuation                   19513 non-null  Int64         
 28  lat                                  19513 non-null  float64       
 29  long                                 19513 non-null  float64       
 30  state                                19513 non-null  object        
 31  customer_age                         19513 non-null  Int64         
 32  product_age                          19327 non-null  float64       
 33  age_group                            19513 non-null  category      
 34  profit                               19327 non-null  float64       
 35  recency                              19513 non-null  Int64         
 36  frequency                            19513 non-null  Int64         
 37  transaction_day                      19513 non-null  object        
 38  transaction_week                     19513 non-null  object        
 39  transaction_month                    19513 non-null  object        
dtypes: Int64(10), bool(1), category(1), datetime64[ns](3), float64(6), object(19)
memory usage: 6.0+ MB

Univariate Analysis¶

We will use seaborne to make some plots and find insights as done here.

If we find any outliers, we will take a deeper look to understand why they exist and how they affect the distribution.

Since this is a Univariate analysis we have no use for the null values, so we will drop them. We will make a copy of the Postcoded dataframe as MasterClean and drop the null values from it.

Highlights from the Univariate Analysis¶

  • From the Price Distribution, it's clear that it is a normally distributed data with no outliers, and a mean of 1110 USD
  • The Profit Distribution is Right Skewed, with a mean of 550 USD
  • The Age is Distributed between 20 and 60, with most of the customers being around the age of 30 and 45, makes sense for them to invest in a bicycle.
  • The Past 3 Year Bicycle purchase is also normally distributed, with a mean of 49 Purchases
  • The Age Segment that buys the most is 31-50, they also bring out the most profit
  • Most Purchases have been made on Mondays and Wednesdays, of course buying a bicycle will be the least of my concerns on a weekend.
  • August and October are the months with the most sales,
  • Most of the sales are made to the Mass Customer segment, accounting for 50% of the sales.
  • Most of the sales are made in the NSW region, accounting for 50% of the sales.
  • Women seem to be getting more sales by just a percent
  • The industry to be targeted is Manufacturing, then Financial Services and Health
In [ ]:
# Converting the Postcoded dataset to a new MasterClean dataset
MasterClean = Postcoded.copy()

# Dropping all the rows with missing values
MasterClean.dropna(inplace=True)
In [ ]:
# Product Distribution by Prices

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='list_price')
plt.title('Product Distribution by Prices')
plt.xlabel('Price in USD')
plt.ylabel('Count')
plt.axvline(MasterClean['list_price'].mean(), color='red', linestyle='--', label='Mean')
plt.show()

print('Mean: ', MasterClean['list_price'].mean())
No description has been provided for this image
Mean:  1110.1365079248606
In [ ]:
# Product Distribution by Profit

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='profit')
plt.title('Product Distribution by Profit')
plt.xlabel('Price in USD')
plt.ylabel('Count')
plt.axvline(MasterClean['profit'].mean(), color='red', linestyle='--', label='Mean')
plt.axvline(MasterClean['profit'].median(), color='green', linestyle='--', label='Median')
plt.show()

print('Mean: ', MasterClean['profit'].mean())
print('Median: ', MasterClean['profit'].median())
No description has been provided for this image
Mean:  552.2205679483417
Median:  445.2099999999999
In [ ]:
# Product Distribution by age

import matplotlib.pyplot as plt
import seaborn as sns

# Convert customer_age to int64 instead of Int64
MasterClean['customer_age'] = MasterClean['customer_age'].astype('int64')

sns.set_style('whitegrid')

plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='customer_age')
plt.title('Product Distribution by Age')
plt.xlabel('Age in Years')
plt.ylabel('Count')
plt.axvline(MasterClean['customer_age'].mean(), color='red', linestyle='--', label='Mean')
plt.show()

print('Mean: ', MasterClean['customer_age'].mean())
No description has been provided for this image
Mean:  39.8009245670678
In [ ]:
# Product Distribution by age

import matplotlib.pyplot as plt
import seaborn as sns

# Convert customer_age to int64 instead of Int64
MasterClean['past_3_years_bike_related_purchases'] = MasterClean['past_3_years_bike_related_purchases'].astype('int64')

sns.set_style('whitegrid')

plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='past_3_years_bike_related_purchases')
plt.title('Distribution of Past 3 Years Bike Related Purchases')
plt.xlabel('Purchases')
plt.ylabel('Count')
plt.axvline(MasterClean['past_3_years_bike_related_purchases'].mean(), color='red', linestyle='--', label='Mean')
plt.show()

print('Mean: ', MasterClean['past_3_years_bike_related_purchases'].mean())
No description has been provided for this image
Mean:  49.495450542999706
In [ ]:
# Order Distribution by Age Group

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
base_color = sns.color_palette()[0]

counts = MasterClean['age_group'].value_counts(normalize=True)
sns.barplot(x= counts.index, y=counts.values, color=base_color)
plt.xlabel('Age Groups')
plt.ylabel('Percentage')
print(counts * 100)

plt.title('Order Distribution by Age Group')
plt.show()
age_group
31-50    52.355445
14-30    25.476959
51-70    21.969475
71-90     0.198122
Name: proportion, dtype: float64
No description has been provided for this image
In [ ]:
# Profit Distribution by Age Group

import matplotlib.pyplot as plt 
import seaborn as sns

plt.figure(figsize=(10, 6))
base_color = sns.color_palette()[0]
profit_by_age = MasterClean.groupby('age_group')['profit'].sum().reset_index()
sns.barplot(x='age_group', y='profit', data=profit_by_age, color=base_color)

# Set labels and title
plt.xlabel('Age Group')
plt.ylabel('Profit in Millions USD')
plt.title('Profit by Age Group')

print(profit_by_age.head())
/tmp/ipykernel_491/3743981089.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  profit_by_age = MasterClean.groupby('age_group')['profit'].sum().reset_index()
  age_group      profit
0     14-30  1916582.54
1     31-50  3944529.54
2     51-70  1648972.98
3     71-90    15576.84
No description has been provided for this image
In [ ]:
# Set the size of the figure
plt.figure(figsize=(12, 6))

# Bar plot for transaction day
plt.subplot(1, 2, 1)  # Subplot 1
sns.countplot(data=MasterClean, x='transaction_day', color=base_color)
plt.title('Number of Sales per Transaction Day')
plt.xticks(rotation=90)
plt.xlabel('Transaction Day')
plt.ylabel('Number of Sales')

# Bar plot for transaction month
plt.subplot(1, 2, 2)  # Subplot 2
sns.countplot(data=MasterClean, x='transaction_month', color=base_color)
plt.title('Number of Sales per Transaction Month')
plt.xlabel('Transaction Month')
plt.ylabel('Number of Sales')
plt.xticks(rotation=90)

# Adjust layout
plt.tight_layout()

# Show the plots
plt.show()
No description has been provided for this image
In [ ]:
#Orders distribution by wealth segment
plt.figure(figsize = (7,4), dpi = 100)

counts = MasterClean['wealth_segment'].value_counts(normalize = True)
sns.barplot(x = counts.index, y = counts.values, color=base_color)
plt.xlabel('Wealth segment')
plt.ylabel('Count')
print(counts * 100)

plt.title("Orders distribution by wealth segment - percent")
plt.show()
wealth_segment
Mass Customer        49.713824
High Net Worth       25.447608
Affluent Customer    24.838568
Name: proportion, dtype: float64
No description has been provided for this image
In [ ]:
# Orders distribution by state
plt.figure(figsize = (8,4), dpi = 100)
counts = MasterClean['state'].value_counts(normalize = True)
sns.barplot(x = counts.index, y = counts.values, color=base_color)
plt.xlabel('state')
plt.ylabel('Count')
print(counts * 100)

plt.title("Orders distribution by state - percent")
plt.show()
state
NSW    52.839742
VIC    25.036689
QLD    22.050191
ACT     0.073378
Name: proportion, dtype: float64
No description has been provided for this image
In [ ]:
import matplotlib.pyplot as plt

plt.figure(figsize=(7, 7), dpi=100)

counts = MasterClean['gender'].value_counts(normalize=True)
labels = counts.index
sizes = counts.values
colors = ['lightcoral', 'lightblue']  # You can customize the colors here

plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal')  # Equal aspect ratio ensures the pie chart is circular.

plt.title("Distribution of Customer Orders by Gender")
plt.show()
No description has been provided for this image
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns


plt.figure(figsize=(12, 8))

# Plotting the distribution of orders by job industry category
plt.subplot(1, 2, 1)  # 1 row, 2 columns, first plot
counts_industry = MasterClean['job_industry_category'].value_counts(normalize=True)
sns.barplot(x=counts_industry.index, y=counts_industry.values, color='b')
plt.xlabel('Job Industry Category')
plt.ylabel('Percentage')
plt.xticks(rotation=90)  # Rotate x-axis labels to 90 degrees
plt.title("Distribution of Orders by Job Industry Category")

# Plotting the distribution of top 10 job titles
plt.subplot(1, 2, 2)  # 1 row, 2 columns, second plot
top_10_titles = MasterClean['job_title'].value_counts(normalize=True).head(10)
sns.barplot(x=top_10_titles.index, y=top_10_titles.values, color='b')
plt.xlabel('Job Title')
plt.ylabel('Percentage')
plt.xticks(rotation=90)  # Rotate x-axis labels to 90 degrees
plt.title("Distribution of Top 10 Job Titles")

plt.tight_layout()  # To prevent overlapping labels and titles
plt.show()
No description has been provided for this image

Bivariate Analysis¶

In this section we will investigate pair of variables and check if there is any relationship between them.

Highlights from the Bivariate Analysis¶

  • More of women who own a car buy bikes, and more of men who do not own a car buy bikes
  • NSW has the highest number of bike buyers and more Female buyers than male buyers
  • More women buy bikes online than men
  • Most profitable brand is WeareA2B

These are usually best left to dashboards since it requires too much code to plot in pandas.

In [ ]:
# Brand by Profit and Standard Cost

import matplotlib.pyplot as plt
import seaborn as sns


plt.figure(figsize=(12, 6))

# First subplot - Profit by Brand
plt.subplot(1, 2, 1)
sns.barplot(data=MasterClean, x="brand", y='profit', color=sns.color_palette()[0])
plt.xlabel('Brand')
plt.ylabel('Profit (in USD)')
plt.xticks(rotation=90)
plt.title("Brand by Profit")

# Second subplot - Standard Cost by Brand
plt.subplot(1, 2, 2)
sns.barplot(data=MasterClean, x="brand", y='standard_cost', color=sns.color_palette()[1])
plt.xlabel('Brand')
plt.ylabel('Standard Cost')
plt.xticks(rotation=90)
plt.title("Brand by Standard Cost")

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
#Gender vs owns a car

plt.figure(figsize = (7,4), dpi = 100)

sns.countplot(data = MasterClean, x = "owns_car",hue='gender')
plt.xlabel('owns a car');
plt.ylabel('count')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Gender vs owns a car")

plt.show()
No description has been provided for this image
In [ ]:
#State vs Gender

plt.figure(figsize = (8,4), dpi = 100)

sns.countplot(data = MasterClean, x = "state",hue='gender')
plt.xlabel('State');
plt.ylabel('Count')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("State VS Gender")

plt.show()
No description has been provided for this image
In [ ]:
#State vs Gender

plt.figure(figsize = (8,4), dpi = 100)

sns.countplot(data = MasterClean, x = "online_order",hue='gender')
plt.xlabel('Online Order');
plt.ylabel('Count')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Online Order VS Gender")

plt.show()
No description has been provided for this image
In [ ]:
# Wealth Segment Percentage and Profits

# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), dpi=100)

# Calculate the order count and profit per wealth segment
counts = MasterClean['wealth_segment'].value_counts()
profits = MasterClean.groupby('wealth_segment')['profit'].sum()

# Normalize the counts to percentages
counts_percent = (counts / counts.sum()) * 100

# Create a bar plot for order distribution
sns.barplot(x=counts_percent.index, y=counts.values, color='lightblue', ax=ax1)
ax1.set_xlabel('Wealth Segment')
ax1.set_ylabel('Count')
ax1.set_title("Order Distribution by Wealth Segment")

# Add profit labels to the bars in the order distribution plot
for i, p in enumerate(ax1.patches):
    ax1.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 10),
                textcoords='offset points')

# Create a bar plot for profit distribution
sns.barplot(x=profits.index, y=profits.values, color='lightgreen', ax=ax2)
ax2.set_xlabel('Wealth Segment')
ax2.set_ylabel('Profit')
ax2.set_title("Profit Distribution by Wealth Segment")

# Add profit labels to the bars in the profit distribution plot
for i, p in enumerate(ax2.patches):
    ax2.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 10),
                textcoords='offset points')

# Adjust spacing between subplots
plt.tight_layout()

plt.show()
/tmp/ipykernel_491/1747977723.py:21: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  ax1.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
/tmp/ipykernel_491/1747977723.py:33: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  ax2.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
No description has been provided for this image

Multivariate Analysis¶

In this section we will use correlation matrix to find the correlation between the variables. Nothing of what we already don't know.

In [ ]:
# Drop non-numeric columns or encode them properly
numeric_columns = MasterClean.select_dtypes(include=['number'])

# Calculate the correlation matrix
correlation_matrix = numeric_columns.corr()

# Create the correlation matrix heatmap
plt.figure(figsize=(8,6), dpi=100)
sns.heatmap(correlation_matrix, cmap="viridis", annot=True)
plt.title("Correlation Matrix")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Task 2 : RFM Analysis¶

RFM stands for Recency, Frequency and Monetary Value, it is a customer segmentation technique that uses past purchase behavior to divide customers into groups. This will include the following steps:

  • Calculating the recency column i.e the number of days since the last purchase
  • Calculating the frequency column i.e the number of purchases made by a unique customer
  • Calculating the monetary column i.e the total amount of money spent by a unique customer
  • Making the RFM dataframe by merging the three columns

Usually we would calculate an RFM score which is also fairly easy to do, but we will be using K-Means clustering to make customer segments instead. We will be using standard scaler and elbow method to find the optimal number of clusters.

Creating the RFM Table¶

  • Since we already have the recency, frequency and profit columns, we can make the RFM table by merging them.
  • We will later map them to the master dataframe
In [ ]:
# Generating the RFM Table
RfmTable = Postcoded.groupby("customer_id").agg(
    {
        "recency": lambda x: x.min(),
        "frequency": lambda x: x.count(),
        "profit": lambda x: x.sum(),
    }
)

RfmTable.rename(
    columns={
        "recency": "recency",
        "frequency": "frequency",
        "profit": "monetary",
    },
    inplace=True,
)

RfmTable.head()
Out[ ]:
recency frequency monetary
customer_id
1 9 11 3018.09
2 130 3 2226.26
4 197 2 220.57
5 18 6 2394.94
6 66 5 3946.55

Finding the Optimal Number of Clusters¶

  • We wrote a function to scale the features (using standard scaler)
  • We wrote a function to fit the KMeans algorithm on the scaled features and return the model as well as the predictions
  • Make sure to run this only on the clean rfm dataframe

The elbow methods shows that the optimal number of clusters is 3, so we will classify each customer into one of these clusters.

In [ ]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

def plot_elbow_method(df):
    scaler = StandardScaler()
    scaled = scaler.fit_transform(df)

    inertia = []
    for i in np.arange(1, 11):
        kmeans = KMeans(n_clusters=i, n_init=10, random_state=1) # n_init is the number of times the K-mean algorithm will run with different centroid seeds (default=10), explicityly set to 10 to suppress warning
        kmeans.fit(scaled)
        inertia.append(kmeans.inertia_)

    plt.plot(np.arange(1, 11), inertia, marker='o')
    plt.xlabel('Number of Clusters')
    plt.ylabel('Inertia')
    plt.title('Elbow Method for Optimal k')
    plt.show()
    
    return scaled

# Example usage:
# Assuming you have a DataFrame called 'rfm_table', you can call the function like this:
# plot_elbow_method(rfm_table)
In [ ]:
scaled_array = plot_elbow_method(RfmTable)
No description has been provided for this image

Classifying the Customers¶

Now that we have the clusters, we will analyse the clusters and see which cluster has the highest profit, and then we will map the customer IDs to the clusters and then predict the high value customers.

  • Gold customers are comparatively less in number but have the highest profit
  • Silver customers are the most in number but struggle to keep up with the gold customers
  • Bronze customers are the least in number, have the best profit to customer ratio but have the lowest overall profit
  • Also Bronze customers very rarely buy any products
In [ ]:
# Use k-means to cluster the RFM table

kmeans = KMeans(n_clusters=3, n_init=10, random_state=10)
kmeans.fit(scaled_array)

# +1 to make the clusters start from 1 instead of 0
RfmTable['cluster'] = kmeans.labels_+1
RfmTable.head()
Out[ ]:
recency frequency monetary cluster
customer_id
1 9 11 3018.09 2
2 130 3 2226.26 3
4 197 2 220.57 3
5 18 6 2394.94 1
6 66 5 3946.55 1
In [ ]:
# Using a ClusterMap to group clusters by their average RFM values

ClusterMap = RfmTable.groupby("cluster").agg(
    {
        "recency": "mean",
        "frequency": "mean",
        "monetary": "mean",
    }
)
ClusterMap["cluster_size"] = RfmTable.groupby("cluster")["cluster"].count()

# Sorting the ClusterMap by monetary_value in descending order
SortedClusterMap = ClusterMap.sort_values(by="monetary", ascending=False)

# Preparing the customer_category column
# customer_category = ["Platinum","Gold", "Silver", "Bronze"]
customer_category = ["Gold", "Silver", "Bronze"]
SortedClusterMap["customer_category"] = customer_category

CategoryCluster = SortedClusterMap["customer_category"].to_dict()

SortedClusterMap
Out[ ]:
recency frequency monetary cluster_size customer_category
cluster
2 40.156137 8.150722 5023.386372 1108 Gold
1 43.959677 4.846198 2312.542056 1736 Silver
3 166.327465 3.642606 1871.216673 568 Bronze

Visualizing the Clusters¶

  • We will make an interactive cluster plot using plotly and see if we can find any insights.
  • We will also use seaborn to make a cluster plot.
In [ ]:
# Using Plotly to plot the ClusterMap

import plotly.express as px

# Create an interactive 3D scatter plot
fig = px.scatter_3d(
    RfmTable,
    x='recency',
    y='frequency',
    z='monetary',
    color='cluster',
    size_max=10,
    opacity=0.7,
    labels={'recency': 'Recency', 'frequency': 'Frequency', 'monetary': 'Monetary'},
    title='RFM Clusters - Interactive 3D Scatter Plot'
)

fig.show()
In [ ]:
# Using SeaBorn to plot the ClusterMap

import seaborn as sns
import matplotlib.pyplot as plt

# Create a joint plot for 'recency' and 'monetary'
sns.set(style="white")
fig1 = sns.jointplot(
    data=RfmTable,
    x='recency',
    y='monetary',
    hue='cluster',
    alpha=0.7,
    markers=["o", "s", "D"],
    height=7
)
fig1.set_axis_labels('Recency', 'Monetary')
fig1.fig.suptitle('RFM Clusters - Recency vs Monetary', y=1.02)

# Create a joint plot for 'frequency' and 'monetary'
fig2 = sns.jointplot(
    data=RfmTable,
    x='frequency',
    y='monetary',
    hue='cluster',
    alpha=0.7,
    markers=["o", "s", "D"],
    height=7
)
fig2.set_axis_labels('Frequency', 'Monetary')
fig2.fig.suptitle('RFM Clusters - Frequency vs Monetary', y=1.02)

# Display the subplots
plt.show()
No description has been provided for this image
No description has been provided for this image

Mapping the Clusters to the Master Dataframe¶

  • This will add two new columns to the master dataframe, cluster and customer_category
In [ ]:
# Mapping the clusters to the customers
Postcoded['cluster'] = Postcoded['customer_id'].map(RfmTable['cluster'])
Postcoded['customer_category'] = Postcoded['cluster'].map(CategoryCluster)
In [ ]:
Postcoded.sample(5)
Out[ ]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost product_first_sold_date first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car tenure address postcode country property_valuation lat long state customer_age product_age age_group profit recency frequency transaction_day transaction_week transaction_month cluster customer_category
13170 14831 15 2618 2017-03-31 False True Norco Bicycles Standard low medium 958.74 748.90 1999-12-19 09:34:07.879592 Jay Prue M 85 1977-12-14 Administrative Assistant IV NaN Mass Customer False False 4 2271 Fordem Plaza 2780 Australia 8 -33.832923 150.435639 NSW 40 18.0 31-50 209.84 276 10 Friday 13 March 2 Gold
2078 12054 76 1790 2017-11-28 True True WeareA2B Standard low medium 642.31 513.85 2008-10-21 09:34:07.879592 Vite Molyneux M 90 1993-09-25 Budget/Accounting Analyst IV Health Affluent Customer False False 1 4092 Brentwood Trail 2147 Australia 10 -33.766075 150.935836 NSW 24 9.0 14-30 128.46 34 5 Tuesday 48 November 1 Silver
5730 12667 89 1594 2017-12-30 True True WeareA2B Touring medium large 1362.99 57.74 2010-07-21 09:34:07.879592 Rolando Amyes M 62 1981-11-26 Human Resources Manager Property Mass Customer False False 7 0 Thackeray Avenue 2100 Australia 10 -33.766409 151.257781 NSW 36 7.0 31-50 1305.25 2 6 Saturday 52 December 2 Gold
5804 16817 31 673 2017-07-02 False True Giant Bicycles Standard medium medium 230.91 173.18 1989-12-30 09:34:07.879592 Artie Ommanney M 35 1975-06-27 VP Sales Health Mass Customer False True 4 2836 Morning Drive 2304 Australia 7 -32.880319 151.738620 NSW 42 28.0 31-50 57.73 183 5 Sunday 26 July 1 Silver
3780 100 0 119 2017-07-31 False True Solex Standard medium medium 478.16 298.72 1992-12-28 09:34:07.879592 Willey Chastanet M 9 1981-12-04 Associate Professor NaN High Net Worth False True 9 5814 Bunker Hill Street 3070 Australia 7 -37.774064 144.999745 VIC 36 25.0 31-50 179.44 154 4 Monday 31 July 1 Silver

Task 2 : Geospatial Analysis¶

In this section we will only focus on geopandas and try to make customer segmentation maps.

  • We will also try to geocode the customer adresses to get another lat and long column geo_lat and geo_long
  • Geocoding will take a lot of time so we'll just work the with the coordinates we already have

Seems like our most sales are in coastal areas, with that lovely view guess I'll buy a bike too.

  • We will make a new column distance_from_coast which will be the distance of the customer from the coast
  • NSW has the most sales, and the most sales are in the coastal areas of NSW
  • Followed by Victoria and then Queensland
In [ ]:
# Pepraring the GeoCustomers dataframe
# We will use this to draw interactive maps

GeoCustomers = Postcoded.copy()

# Use groupby to get unique customer_id 
GeoCustomers = GeoCustomers.groupby('customer_id').first().reset_index()
GeoCustomers = GeoCustomers[['customer_id', 'lat', 'long', 'state', 'cluster', 'customer_category']]

GeoCustomers.head()
Out[ ]:
customer_id lat long state cluster customer_category
0 1 -33.894912 151.206211 NSW 2 Gold
1 2 -33.731651 150.955942 NSW 3 Bronze
2 4 -28.035453 153.241258 QLD 3 Bronze
3 5 -30.604667 152.956681 NSW 1 Silver
4 6 -38.215906 144.334005 VIC 1 Silver

Marker Cluster¶

In [ ]:
# Importing Folium and Geopandas
import geopandas as gpd
import folium

# Importing folium plugins
from folium.plugins import MarkerCluster, HeatMap

# Creating a simple map
m_1 = folium.Map(location=[-25.2744, 133.7751], tiles='cartodbpositron', zoom_start=4)

# Adding transaction points to the map using MarkerCluster
mc = MarkerCluster()
for idx, row in GeoCustomers.iterrows():
    if not np.isnan(row['lat']) and not np.isnan(row['long']):
        mc.add_child(folium.Marker([row['lat'], row['long']]))
        
m_1.add_child(mc)

# Display the map
m_1
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Heatmap¶

In [ ]:
# Drawing a heatmap of the transaction points
m_2 = folium.Map(location=[-25.2744, 133.7751], tiles='cartodbpositron', zoom_start=4)

# Adding transaction points to the map using HeatMap
HeatMap(data=GeoCustomers[['lat', 'long']], radius=15).add_to(m_2)

# Display the map
m_2
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Bubble Map¶

In [ ]:
# Draw a bubble map to visualize the customer catergories gold, silver and bronze

# Create a base map
m_3 = folium.Map(location=[-25.2744, 133.7751], tiles='cartodbpositron', zoom_start=4)

# Add a bubble map to the base map
for idx, row in GeoCustomers.iterrows():
    if not np.isnan(row['lat']) and not np.isnan(row['long']):
        if row['customer_category'] == 'Gold':
            folium.CircleMarker([row['lat'], row['long']], radius=5, color='yellow', fill=True).add_to(m_3)
        elif row['customer_category'] == 'Silver':
            folium.CircleMarker([row['lat'], row['long']], radius=5, color='blue', fill=True).add_to(m_3)
        elif row['customer_category'] == 'Bronze':
            folium.CircleMarker([row['lat'], row['long']], radius=5, color='brown', fill=True).add_to(m_3)
            
# Display the map
m_3
            
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Analysing Coastline Proximity¶

Refer to this notebook for the code.

In [ ]:
GeoCustomers
Out[ ]:
customer_id lat long state cluster customer_category
0 1 -33.894912 151.206211 NSW 2 Gold
1 2 -33.731651 150.955942 NSW 3 Bronze
2 4 -28.035453 153.241258 QLD 3 Bronze
3 5 -30.604667 152.956681 NSW 1 Silver
4 6 -38.215906 144.334005 VIC 1 Silver
... ... ... ... ... ... ...
3407 3496 -33.937716 150.848737 NSW 3 Bronze
3408 3497 -38.043995 145.264296 VIC 1 Silver
3409 3498 -37.807135 144.861162 VIC 3 Bronze
3410 3499 -27.549179 152.951385 QLD 2 Gold
3411 3500 -33.766409 151.257781 NSW 3 Bronze

3412 rows × 6 columns

In [ ]:
from shapely.ops import nearest_points, Point
import haversine as hs
from haversine import Unit

def coastline_minima(df, lat_col, long_col, country):
    """
    This function takes in a dataframe with latitude and longitude columns and returns the nearest point on the coastline and the distance in km

    Required libraries: geopandas, shapely, folium, haversine
    Returns: original dataframe with two new columns: nearest_point and distance_to_coast

    """
    # Getting the coastline
    world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
    coastline = world[world.name.str.contains(country)].boundary

    # Creating empty lists to store the nearest point and the distance
    nearest_points_list = []
    distance_list = []

    for i in range(len(df)):
        geo = Point(df[long_col][i], df[lat_col][i])
        nearest = nearest_points(geo, coastline)[1]
        cos = Point(nearest.x, nearest.y)
        loc1 = (geo.y, geo.x)
        loc2 = (cos.y, cos.x)
        result = hs.haversine(loc1, loc2, unit=Unit.KILOMETERS)

        nearest_points_list.append(nearest)
        distance_list.append(result)

    # Adding the new columns to the dataframe
    df["nearest_point"] = nearest_points_list
    df["distance_to_coast"] = distance_list

    return df
In [ ]:
coastline_minima(GeoCustomers, 'lat', 'long', 'Australia')
GeoCustomers.head()
/tmp/ipykernel_491/1020208499.py:14: FutureWarning:

The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.

Out[ ]:
customer_id lat long state cluster customer_category nearest_point distance_to_coast
0 1 -33.894912 151.206211 NSW 2 Gold 137 POINT (151.26433 -33.93411) dtype: geom... 6.910553
1 2 -33.731651 150.955942 NSW 3 Bronze 137 POINT (151.26176 -33.93792) dtype: geom... 36.385439
2 4 -28.035453 153.241258 QLD 3 Bronze 137 POINT (153.45911 -27.91328) dtype: geom... 25.342373
3 5 -30.604667 152.956681 NSW 1 Silver 137 POINT (153.07812 -30.60035) dtype: geom... 11.632766
4 6 -38.215906 144.334005 VIC 1 Silver 137 POINT (144.33147 -38.21284) dtype: geom... 0.406144
In [ ]:
# Merging Geocustomers dataframe with the Postcoded dataframe

Postcoded = pd.merge(
    Postcoded,
    GeoCustomers[['customer_id', 'nearest_point', 'distance_to_coast']],
    how="inner",
    left_on="customer_id",
    right_on="customer_id",
)
In [ ]:
Postcoded.sample(5)
Out[ ]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost product_first_sold_date first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car tenure address postcode country property_valuation lat long state customer_age product_age age_group profit recency frequency transaction_day transaction_week transaction_month cluster customer_category nearest_point distance_to_coast
15259 10956 50 710 2017-11-21 False True Giant Bicycles Standard medium medium 642.70 211.37 1996-04-02 09:34:07.879592 Benton Oakly M 76 1970-03-01 Sales Representative Retail High Net Worth False True 13 1721 Nobel Junction 2753 Australia 8 -33.604022 150.646053 NSW 47 21.0 31-50 431.33 41 7 Tuesday 47 November 2 Gold 137 POINT (151.22403 -33.99385) dtype: geom... 68.783845
8448 2666 28 298 2017-03-07 False True Solex Road medium small 1703.52 1516.13 2005-04-27 09:34:07.879592 Maurise Lenglet M 69 1956-08-15 Engineer I Manufacturing High Net Worth False False 9 83 Brickson Park Drive 3029 Australia 8 -37.837165 144.705831 VIC 61 12.0 51-70 187.39 300 11 Tuesday 10 March 2 Gold 137 POINT (144.75898 -37.99074) dtype: geom... 17.702334
17705 19788 92 2474 2017-08-09 False True WeareA2B Touring medium large 1890.39 260.14 1992-12-28 09:34:07.879592 Chrysa Burrett F 59 1980-06-08 Executive Secretary Argiculture High Net Worth False True 6 70 Erie Street 4814 Australia 2 -19.285821 146.754745 QLD 37 25.0 31-50 1630.25 145 8 Wednesday 32 August 2 Gold 137 POINT (146.81361 -19.16373) dtype: geom... 14.916644
1252 11862 76 867 2017-01-31 True True WeareA2B Standard low medium 642.31 513.85 2008-10-21 09:34:07.879592 Lura MacKim F 12 1973-05-10 VP Marketing NaN Mass Customer False False 17 8360 Washington Avenue 3124 Australia 12 -37.840867 145.068191 VIC 44 9.0 31-50 128.46 335 6 Tuesday 05 January 3 Bronze 137 POINT (145.03221 -37.89619) dtype: geom... 6.914766
7009 7460 16 1234 2017-02-12 NaN True Norco Bicycles Standard high small 1661.92 1479.11 1988-09-20 09:34:07.879592 Rancell Yven M 87 1965-04-19 Geologist I NaN Mass Customer False False 17 3765 Mandrake Alley 2763 Australia 9 -33.725619 150.895470 NSW 52 29.0 51-70 182.81 323 8 Sunday 06 February 1 Silver 137 POINT (151.24565 -33.96180) dtype: geom... 41.660704

Task 3 : Model Development¶

Final task of this project, we will use all the data and features we built earlier and make a machine learning model to predict which customer segment the new customers will fall into.

We will build two models

  • a simple logistic regression model
  • a HistGradientBoostingClassifier model

We will also use RandomSearchCV to find the best parameters for the models.

Data Preprocessing¶

Lets see what we are working with

  • We have Postcoded as the Training Data
  • We have NewCustomerList as the Testing Data

We need to add features to the testing data as well.

In [ ]:
Train = Postcoded.copy()
Train.columns
Out[ ]:
Index(['transaction_id', 'product_id', 'customer_id', 'transaction_date',
       'online_order', 'order_status', 'brand', 'product_line',
       'product_class', 'product_size', 'list_price', 'standard_cost',
       'product_first_sold_date', 'first_name', 'last_name', 'gender',
       'past_3_years_bike_related_purchases', 'DOB', 'job_title',
       'job_industry_category', 'wealth_segment', 'deceased_indicator',
       'owns_car', 'tenure', 'address', 'postcode', 'country',
       'property_valuation', 'lat', 'long', 'state', 'customer_age',
       'product_age', 'age_group', 'profit', 'recency', 'frequency',
       'transaction_day', 'transaction_week', 'transaction_month', 'cluster',
       'customer_category', 'nearest_point', 'distance_to_coast'],
      dtype='object')
In [ ]:
Test = NewCustomerList.copy()
Test.columns
Out[ ]:
Index(['first_name', 'last_name', 'gender',
       'past_3_years_bike_related_purchases', 'DOB', 'job_title',
       'job_industry_category', 'wealth_segment', 'deceased_indicator',
       'owns_car', 'tenure', 'address', 'postcode', 'state', 'country',
       'property_valuation', 'Rank', 'Value'],
      dtype='object')

Feature Engineering¶

We'll be engineering the following features in the test data:

  • customer_age
  • age_group
  • lat, long
  • distance_to_coast

So basically we will only be expanding the DOB and postcode columns. Lets also try and multiply the property_valuation with the tenure, that might do something.

And we will add two more features based on the Property Valuation and Tenure to both the Training and Testing Data

  • tenure_valuation_mul
  • tenure_valuation_div

And also potential_customer_id to the testing data

Age and Age Group¶

  • customer_age
  • age_group
In [ ]:
# Customer Age Feature
Test["DOB"] = pd.to_datetime(Test["DOB"])
Test["customer_age"] = (end_date - Test["DOB"]).dt.days // 365.25
Test["customer_age"] = Test["customer_age"].astype("Int64")

# Keeping only the values between 14 and 100
Test = Test[Test["customer_age"].between(14, 100)]
In [ ]:
# Calculate the Age Group
age_group = pd.cut(
    Test["customer_age"],
    bins=[14,30,50,70,90],
    labels=["14-30","30-50","50-70","70-90"]
)

Test["age_group"] = age_group

Geo Location¶

  • lat
  • long
  • distance_to_coast
  • state
  • nearest_point
In [ ]:
# Adding the Coordinate Columns
Test = pd.merge(
    Test,
    Postcodes[['postcode', 'lat', 'long', 'state']],
    how="inner",
    left_on="postcode",
    right_on="postcode",
)

print(Test.shape)
print(Postcodes.shape)

# Dropping the state_x column
Test.drop(columns="state_x", inplace=True)
Test.rename(columns={"state_y": "state"}, inplace=True)

Test.sample(5)
(983, 23)
(3167, 4)
Out[ ]:
first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car tenure address postcode country property_valuation Rank Value customer_age age_group lat long state
444 Jared Fendlow M 79 1963-08-25 Media Manager II Financial Services Mass Customer False False 15 6195 Bellgrove Lane 4211 Australia 7 832 0.5750 54 50-70 -28.035453 153.241258 QLD
753 Georgi NaN M 29 1970-01-14 Assistant Manager Manufacturing High Net Worth False False 11 59 Garrison Terrace 3215 Australia 4 485 0.8755 47 30-50 -38.111002 144.334520 VIC
712 Frederigo Cribbott M 35 1965-03-15 Social Worker Health Mass Customer False False 16 42280 Namekagon Crossing 2140 Australia 8 754 0.6460 52 50-70 -33.835018 151.070853 NSW
279 Rebeca Aggas F 66 1953-02-27 Social Worker Health Affluent Customer False False 21 7026 Katie Lane 3818 Australia 1 127 1.2500 64 50-70 -38.096289 145.867337 VIC
250 Kearney Cuddehy M 88 1997-05-18 Marketing Assistant Retail Mass Customer False False 12 98 Shoshone Road 4207 Australia 6 634 0.7395 20 14-30 -27.733991 153.204523 QLD
In [ ]:
# Caculating the distance to coast
Test = coastline_minima(Test, 'lat', 'long', 'Australia')
/tmp/ipykernel_491/1020208499.py:14: FutureWarning:

The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.

Tenure and Property Valuation¶

  • tenure_valuation_mul
  • tenure_valuation_div
In [ ]:
# Multiplying and Dividing the tenure and property_valuation

Train["tenure_valuation_mul"] = Train["tenure"] * Train["property_valuation"]
Train["tenure_valuation_div"] = Train["tenure"] / Train["property_valuation"]

# Same for Test

Test["tenure_valuation_mul"] = Test["tenure"] * Test["property_valuation"]
Test["tenure_valuation_div"] = Test["tenure"] / Test["property_valuation"]

Potential Customer ID¶

  • potential_customer_id
In [ ]:
Test['potential_customer_id'] = Test.index + 1

Building a Basic Logistic Regression Model¶

  • We will copy the training data to data
  • Look out for leakage while building the model, drop all the columns that are not present in the testing data
  • Also drop the columns that can't be classified as numerical or categorical

Feature Selection¶

  • defining cols with all the features that are supposed to go into data
In [ ]:
# Getting all the valid columns for building the model

cols = [
    "gender",
    "past_3_years_bike_related_purchases",
    "job_title",
    "job_industry_category",
    "wealth_segment",
    "deceased_indicator",
    "owns_car",
    "tenure",
    "property_valuation",
    "lat",
    "long",
    "state",
    "customer_age",
    "age_group",
    "distance_to_coast",
    "tenure_valuation_mul",
    "tenure_valuation_div",
    "customer_category",
]

data = Train[cols]
In [ ]:
# Defining the target variable

target_name = "customer_category"
target = data[target_name]

data = data.drop(columns=target_name)

Defining Categorical and Numerical Columns¶

  • We will convert all the datatypes appropriately so that there's only either objects or numbers in the dataframe
  • We will use sklearns make_column_selector to select the columns
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19513 entries, 0 to 19512
Data columns (total 17 columns):
 #   Column                               Non-Null Count  Dtype   
---  ------                               --------------  -----   
 0   gender                               19513 non-null  object  
 1   past_3_years_bike_related_purchases  19513 non-null  Int64   
 2   job_title                            17173 non-null  object  
 3   job_industry_category                16291 non-null  object  
 4   wealth_segment                       19513 non-null  object  
 5   deceased_indicator                   19513 non-null  object  
 6   owns_car                             19513 non-null  object  
 7   tenure                               19513 non-null  Int64   
 8   property_valuation                   19513 non-null  Int64   
 9   lat                                  19513 non-null  float64 
 10  long                                 19513 non-null  float64 
 11  state                                19513 non-null  object  
 12  customer_age                         19513 non-null  Int64   
 13  age_group                            19513 non-null  category
 14  distance_to_coast                    19513 non-null  float64 
 15  tenure_valuation_mul                 19513 non-null  Int64   
 16  tenure_valuation_div                 19513 non-null  Float64 
dtypes: Float64(1), Int64(5), category(1), float64(3), object(7)
memory usage: 2.5+ MB
In [ ]:
# Convert age_group to object
data["age_group"] = data["age_group"].astype("object")

# Convert bools to object
bool_cols = data.columns[data.dtypes.eq("bool")]
data[bool_cols] = data[bool_cols].astype("object")

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19513 entries, 0 to 19512
Data columns (total 17 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   gender                               19513 non-null  object 
 1   past_3_years_bike_related_purchases  19513 non-null  Int64  
 2   job_title                            17173 non-null  object 
 3   job_industry_category                16291 non-null  object 
 4   wealth_segment                       19513 non-null  object 
 5   deceased_indicator                   19513 non-null  object 
 6   owns_car                             19513 non-null  object 
 7   tenure                               19513 non-null  Int64  
 8   property_valuation                   19513 non-null  Int64  
 9   lat                                  19513 non-null  float64
 10  long                                 19513 non-null  float64
 11  state                                19513 non-null  object 
 12  customer_age                         19513 non-null  Int64  
 13  age_group                            19513 non-null  object 
 14  distance_to_coast                    19513 non-null  float64
 15  tenure_valuation_mul                 19513 non-null  Int64  
 16  tenure_valuation_div                 19513 non-null  Float64
dtypes: Float64(1), Int64(5), float64(3), object(8)
memory usage: 2.6+ MB
In [ ]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
category_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
category_columns = category_columns_selector(data)

# Print the lengths of numerical_columns, category_columns, and data.columns
print(f"Number of numerical columns: {len(numerical_columns)}")
print(f"Number of categorical columns: {len(category_columns)}")
print(f"Total number of columns: {len(data.columns)}")
Number of numerical columns: 9
Number of categorical columns: 8
Total number of columns: 17

Building the Pipeline¶

In [ ]:
# One-hot encoding the categorical columns and standardizing the numerical columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
In [ ]:
# Creating the Transformer and preprocessor object
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [
        ("cat-preprocessor", categorical_preprocessor, category_columns),
        ("num-preprocessor", numerical_preprocessor, numerical_columns),
    ]
)
In [ ]:
# Building a model with a pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))
model
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat-preprocessor',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['gender', 'job_title',
                                                   'job_industry_category',
                                                   'wealth_segment',
                                                   'deceased_indicator',
                                                   'owns_car', 'state',
                                                   'age_group']),
                                                 ('num-preprocessor',
                                                  StandardScaler(),
                                                  ['past_3_years_bike_related_purchases',
                                                   'tenure',
                                                   'property_valuation', 'lat',
                                                   'long', 'customer_age',
                                                   'distance_to_coast',
                                                   'tenure_valuation_mul',
                                                   'tenure_valuation_div'])])),
                ('logisticregression', LogisticRegression(max_iter=1000))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat-preprocessor',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['gender', 'job_title',
                                                   'job_industry_category',
                                                   'wealth_segment',
                                                   'deceased_indicator',
                                                   'owns_car', 'state',
                                                   'age_group']),
                                                 ('num-preprocessor',
                                                  StandardScaler(),
                                                  ['past_3_years_bike_related_purchases',
                                                   'tenure',
                                                   'property_valuation', 'lat',
                                                   'long', 'customer_age',
                                                   'distance_to_coast',
                                                   'tenure_valuation_mul',
                                                   'tenure_valuation_div'])])),
                ('logisticregression', LogisticRegression(max_iter=1000))])
ColumnTransformer(transformers=[('cat-preprocessor',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['gender', 'job_title',
                                  'job_industry_category', 'wealth_segment',
                                  'deceased_indicator', 'owns_car', 'state',
                                  'age_group']),
                                ('num-preprocessor', StandardScaler(),
                                 ['past_3_years_bike_related_purchases',
                                  'tenure', 'property_valuation', 'lat', 'long',
                                  'customer_age', 'distance_to_coast',
                                  'tenure_valuation_mul',
                                  'tenure_valuation_div'])])
['gender', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'owns_car', 'state', 'age_group']
OneHotEncoder(handle_unknown='ignore')
['past_3_years_bike_related_purchases', 'tenure', 'property_valuation', 'lat', 'long', 'customer_age', 'distance_to_coast', 'tenure_valuation_mul', 'tenure_valuation_div']
StandardScaler()
LogisticRegression(max_iter=1000)
In [ ]:
# Making the train-test split

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.2, random_state=42
)
In [ ]:
_ = model.fit(data_train, target_train)
In [ ]:
data_test.head()
Out[ ]:
gender past_3_years_bike_related_purchases job_title job_industry_category wealth_segment deceased_indicator owns_car tenure property_valuation lat long state customer_age age_group distance_to_coast tenure_valuation_mul tenure_valuation_div
9530 F 60 Administrative Officer Financial Services High Net Worth False False 12 7 -37.869072 144.999026 VIC 38 31-50 3.973526 84 1.714286
385 M 68 Financial Analyst Financial Services Mass Customer False False 3 10 -33.731523 150.931074 NSW 36 31-50 38.424585 30 0.3
15307 M 56 Editor Financial Services Mass Customer False False 5 10 -37.744406 145.082685 VIC 41 31-50 17.449963 50 0.5
4153 F 78 Quality Engineer Financial Services Affluent Customer False True 17 11 -33.732380 151.000948 NSW 43 31-50 32.667056 187 1.545455
8658 M 23 Librarian Entertainment Mass Customer False False 10 8 -33.783751 150.900547 NSW 48 31-50 38.026580 80 1.25

Scores and Predictions¶

  • Model Score 0.5318985395849347
In [ ]:
model.predict(data_test)[:5]
Out[ ]:
array(['Gold', 'Gold', 'Silver', 'Gold', 'Gold'], dtype=object)
In [ ]:
target_test[:5]
Out[ ]:
9530     Bronze
385        Gold
15307      Gold
4153       Gold
8658       Gold
Name: customer_category, dtype: object
In [ ]:
model.score(data_test, target_test)
Out[ ]:
0.5318985395849347
In [ ]:
# Evaluating the model with cross-validation
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
cv_results
Out[ ]:
{'fit_time': array([1.96170378, 2.41558146, 1.45675778, 1.40848422, 1.35900307]),
 'score_time': array([0.01325226, 0.02733541, 0.01228523, 0.01573133, 0.01378798]),
 'test_score': array([0.46682039, 0.45836536, 0.46169613, 0.47514095, 0.46258329])}
In [ ]:
# Getting the final score for our model

scores = cv_results["test_score"]

# Print mean cross-validation accuracy
print(f"Mean Accuracy: {scores.mean():.3f}")
print(f"Accuracy standard deviation: {scores.std():.3f}")
Mean Accuracy: 0.465
Accuracy standard deviation: 0.006

Building a Better Model¶

  • HistGradientBoostingClassifier is a better model than our naive logistic regression model
  • We will use RandomizedSearchCV to find the best parameters for the model

Apparently Scaling numerical features is indeed useless for most decision tree models in general and for HistGradientBoostingClassifier in particular. We get better results with the Ordinal Encoder than with the One Hot Encoder, and Scaling the numerical features does not make a difference.

Visit here for more information.

Meaningful order Non-meaningful order
Tree-based model OrdinalEncoder OrdinalEncoder
Linear model OrdinalEncoder with caution OneHotEncoder
  • OneHotEncoder : always does something meaningful, but can be unnecessarily slow with trees.
  • OrdinalEncoder : can be detrimental for linear models unless your category has a meaningful order and you make sure that OrdinalEncoder respects this order. Trees can deal with OrdinalEncoder fine as long as they are deep enough.

Hence we will use the Ordinal Encoder for the categorical features, for our use case since it is giving us the best score.

Bulding the Pipeline¶

  • We will build a pipeline with Ordinal Encoder and HistGradientBoostingClassifier
In [ ]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)
In [ ]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
    [("cat_preprocessor", categorical_preprocessor, categorical_columns)],
    remainder="passthrough",
)
In [ ]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        (
            "classifier",
            HistGradientBoostingClassifier(random_state=42, max_leaf_nodes=4),
        ),
    ]
)

model
Out[ ]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat_preprocessor',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['gender', 'job_title',
                                                   'job_industry_category',
                                                   'wealth_segment',
                                                   'deceased_indicator',
                                                   'owns_car', 'state',
                                                   'age_group'])])),
                ('classifier',
                 HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat_preprocessor',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['gender', 'job_title',
                                                   'job_industry_category',
                                                   'wealth_segment',
                                                   'deceased_indicator',
                                                   'owns_car', 'state',
                                                   'age_group'])])),
                ('classifier',
                 HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                random_state=42))])
ColumnTransformer(remainder='passthrough',
                  transformers=[('cat_preprocessor',
                                 OrdinalEncoder(handle_unknown='use_encoded_value',
                                                unknown_value=-1),
                                 ['gender', 'job_title',
                                  'job_industry_category', 'wealth_segment',
                                  'deceased_indicator', 'owns_car', 'state',
                                  'age_group'])])
['gender', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'owns_car', 'state', 'age_group']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42)

Applying RandomizedSearchCV on the Pipeline¶

  • Code referenced form INRIA MOOC
The best parameters are:
{'classifier__l2_regularization': 0.022391943681217833,
 'classifier__learning_rate': 0.006270138607046264,
 'classifier__max_bins': 4,
 'classifier__max_leaf_nodes': 142,
 'classifier__min_samples_leaf': 31}
In [ ]:
from scipy.stats import loguniform


class loguniform_int:
    """Integer valued version of the log-uniform distribution"""

    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)
In [ ]:
%%time
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "classifier__l2_regularization": loguniform(1e-6, 1e3),
    "classifier__learning_rate": loguniform(0.001, 10),
    "classifier__max_leaf_nodes": loguniform_int(2, 256),
    "classifier__min_samples_leaf": loguniform_int(1, 100),
    "classifier__max_bins": loguniform_int(2, 255),
}

model_random_search = RandomizedSearchCV(
    model,
    param_distributions=param_distributions,
    n_iter=5,
    cv=5,
    verbose=1,
    n_jobs=-1,
)
model_random_search.fit(data_train, target_train)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
CPU times: user 25.6 s, sys: 580 ms, total: 26.1 s
Wall time: 17.6 s
Out[ ]:
RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('cat_preprocessor',
                                                                               OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                                              unknown_value=-1),
                                                                               ['gender',
                                                                                'job_title',
                                                                                'job_industry_category',
                                                                                'wealth_segment',
                                                                                'deceased_indicator',
                                                                                'owns_car',
                                                                                'state',
                                                                                'age_group'])])),
                                             ('classifier',
                                              HistGr...
                   param_distributions={'classifier__l2_regularization': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f920c9ea2f0>,
                                        'classifier__learning_rate': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f91a5a77880>,
                                        'classifier__max_bins': <__main__.loguniform_int object at 0x7f91a5a9c340>,
                                        'classifier__max_leaf_nodes': <__main__.loguniform_int object at 0x7f91a5a76aa0>,
                                        'classifier__min_samples_leaf': <__main__.loguniform_int object at 0x7f91a5a77580>},
                   verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('cat_preprocessor',
                                                                               OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                                              unknown_value=-1),
                                                                               ['gender',
                                                                                'job_title',
                                                                                'job_industry_category',
                                                                                'wealth_segment',
                                                                                'deceased_indicator',
                                                                                'owns_car',
                                                                                'state',
                                                                                'age_group'])])),
                                             ('classifier',
                                              HistGr...
                   param_distributions={'classifier__l2_regularization': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f920c9ea2f0>,
                                        'classifier__learning_rate': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f91a5a77880>,
                                        'classifier__max_bins': <__main__.loguniform_int object at 0x7f91a5a9c340>,
                                        'classifier__max_leaf_nodes': <__main__.loguniform_int object at 0x7f91a5a76aa0>,
                                        'classifier__min_samples_leaf': <__main__.loguniform_int object at 0x7f91a5a77580>},
                   verbose=1)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat_preprocessor',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['gender', 'job_title',
                                                   'job_industry_category',
                                                   'wealth_segment',
                                                   'deceased_indicator',
                                                   'owns_car', 'state',
                                                   'age_group'])])),
                ('classifier',
                 HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                random_state=42))])
ColumnTransformer(remainder='passthrough',
                  transformers=[('cat_preprocessor',
                                 OrdinalEncoder(handle_unknown='use_encoded_value',
                                                unknown_value=-1),
                                 ['gender', 'job_title',
                                  'job_industry_category', 'wealth_segment',
                                  'deceased_indicator', 'owns_car', 'state',
                                  'age_group'])])
['gender', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'owns_car', 'state', 'age_group']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42)

Checking Accuracy of the Model¶

In [ ]:
accuracy = model_random_search.score(data_test, target_test)

print(f"The test accuracy score of the best model is {accuracy:.2f}")
The test accuracy score of the best model is 0.99

In [ ]:
from pprint import pprint

print("The best parameters are:")
pprint(model_random_search.best_params_)
The best parameters are:
{'classifier__l2_regularization': 0.00019812287323236764,
 'classifier__learning_rate': 0.1990528942243387,
 'classifier__max_bins': 12,
 'classifier__max_leaf_nodes': 165,
 'classifier__min_samples_leaf': 93}
In [ ]:
# get the parameter names
column_results = [f"param_{name}" for name in param_distributions.keys()]
column_results += ["mean_test_score", "std_test_score", "rank_test_score"]

cv_results = pd.DataFrame(model_random_search.cv_results_)
cv_results = cv_results[column_results].sort_values(
    "mean_test_score", ascending=False
)


def shorten_param(param_name):
    if "__" in param_name:
        return param_name.rsplit("__", 1)[1]
    return param_name


cv_results = cv_results.rename(shorten_param, axis=1)
cv_results
Out[ ]:
l2_regularization learning_rate max_leaf_nodes min_samples_leaf max_bins mean_test_score std_test_score rank_test_score
0 0.000198 0.199053 165 93 12 0.972324 0.004238 1
3 500.934714 0.474991 155 2 23 0.901873 0.004891 2
2 0.00956 1.13531 22 1 35 0.747030 0.039649 3
4 7.588438 0.001835 8 7 30 0.505877 0.008183 4
1 0.000154 1.238161 3 94 2 0.442938 0.026446 5

Visualizing the Best Parameters¶

In [ ]:
import seaborn as sns
import numpy as np

df = pd.DataFrame(
    {
        "max_leaf_nodes": cv_results["max_leaf_nodes"],
        "learning_rate": cv_results["learning_rate"],
        "score_bin": pd.cut(
            cv_results["mean_test_score"], bins=np.linspace(0.5, 1.0, 6)
        ),
    }
)
sns.set_palette("YlGnBu_r")
ax = sns.scatterplot(
    data=df,
    x="max_leaf_nodes",
    y="learning_rate",
    hue="score_bin",
    s=50,
    color="k",
    edgecolor=None,
)
ax.set_xscale("log")
ax.set_yscale("log")

_ = ax.legend(
    title="mean_test_score", loc="center left", bbox_to_anchor=(1, 0.5)
)
No description has been provided for this image
In [ ]:
import numpy as np
import plotly.express as px

# Rest of your code
log10_func = np.vectorize(np.log10)
log2_func = np.vectorize(np.log2)

fig = px.parallel_coordinates(
    cv_results.rename(shorten_param, axis=1).apply({
        "learning_rate": log10_func,
        "max_leaf_nodes": log2_func,
        "min_samples_leaf": log2_func,
        "max_bins": log10_func,
        "l2_regularization": log10_func,
        "mean_test_score": lambda x: x,
    }),
    color="mean_test_score",
    color_continuous_scale=px.colors.sequential.Viridis,
)
fig.show()
In [ ]:
# Deriving Predictions

predictions = Test.copy()
predictions["customer_category"] = model_random_search.predict(Test)

predictions.to_csv("Predictions.csv", index=False)

End Notes¶

This was a fun project, great hands on experience with data analysis and machine learning. I'll soon be writing a blog post on this, I'll link it here when it's done. I tend to start too many things and end up completeing only a few, a lot of things get dragged on for months, but I'm glad I completed this one. Also this was initially supposed to be a dashboarding task, will get a dashboard ready for this as well, might be great to embed it in the blog post, and on my website.

Take a look at my website

also, a huge thanks to INRIA for their MOOC on scikit learn, it was a great help, I've been looking for a solution of a similar problem but I didn't have any luck finding it on kaggle, so welp, feels great to come up with my own solution.

Thanks for reading, hope you enjoyed it, feel free to leave any feedback, I'm always looking to improve.

References¶

  • Scikit Learn MOOC
  • Australian Bureau of Statistics
  • Australian Postcodes
  • Scikit Learn
  • Geopandas
  • Plotly
  • Seaborn
  • Pandas
  • Numpy
  • Matplotlib
  • Folium
  • Geopy
  • Scipy
  • Shapely